Purpose: Quantification, EDA and normalization for single-cell RNA-Seq data.

This R/Bioconductor package includes:

  1. Wrapper functions for transcript/Gene quantification using pseudoalignments (Salmon or Kallisto) to calculate gene-level expression values
  2. A new SCESet (single-cell ExpressionSet) class in Bioconductor
  3. Automated calculation of QC meterics (feature-level and cell-level); filtering
  4. Data visualization to QC metrics
  5. Normalization and batch correction methods to identify and remove uninteresting covariates
scater Overview

scater Overview

Citations:

Workflows:

To install the scater R/Bioconductor package

source("http://bioconductor.org/biocLite.R")
biocLite("scater")
library(scater)
## Loading required package: Biobase
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, cbind, colnames,
##     do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, lengths, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff,
##     sort, table, tapply, union, unique, unsplit
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## Loading required package: ggplot2
## 
## Attaching package: 'scater'
## The following object is masked from 'package:stats':
## 
##     filter
data("sc_example_counts")
data("sc_example_cell_info")

Pre-process outside of scater

quality control

scater assumes FASTQ files of individual cells have been assessed and removed based on quality control metrics based on bulk RNA-Seq metrics. For example, use fastqc, identify and remove libraries that are heavily degraded, libraries with a large amount of ribosomal, mitochondirla or other RNA type.

The SCESet Data Class in scater

This class holds the expression values. Class structure is derived from the ExpressionSet class in Bioconductor’s Biobase package (derived for microarray and bulk RNA-Seq analyses; allows assay data, gene/transcript metadata, and sample metadata). Adds additional slots to store:

  1. reduced-dimension representation of data for data viz
  2. cell-cell and gene-gene pairwise distance matrices for clustering or regulatory network reconstruction
  3. bootstrapped expression results to assess accuracy for quantification (e.g. from Kallisto)
  4. info about feature/cell controls for normalization (e.g. ERCC spike-ins/control genes), QC, detect highly variable genes, etc
  5. epigenetic info or FACS marker expression data

To create a SCESet object, use newSCESet() and provided three data sets:

You can also provide the normalized expression values without needing to provide the count data

Reading in Data

Example:

# define cell level data set (phenotypic data)
pdMat <- new("AnnotatedDataFrame", data = sc_example_cell_info)
rownames(pdMat) <- pdMat$Cell

# define feature level data set (gene level)
gene_df <- data.frame(Gene = rownames(sc_example_counts))
rownames(gene_df) <- gene_df$Gene
fdMat <- new("AnnotatedDataFrame", data = gene_df)

scMat <- newSCESet(countData = sc_example_counts, phenoData = pdMat, featureData = fdMat)
scMat
## SCESet (storageMode: environment)
## assayData: 2000 features, 40 samples 
##   element names: counts, cpm, exprs, is_exprs 
## protocolData: none
## phenoData
##   sampleNames: Cell_001 Cell_002 ... Cell_040 (40 total)
##   varLabels: Cell Mutation_Status Cell_Cycle Treatment
##   varMetadata: labelDescription
## featureData
##   featureNames: Gene_0001 Gene_0002 ... Gene_2000 (2000 total)
##   fvarLabels: Gene
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
# to add a different expression matrix e.g. with avg count of 5 to avoid log(0)
exprs(scMat) <- edgeR::cpm.default(counts(scMat), prior.count = 5, log = TRUE)

# or can create a new SCESet object with only expression data 
#   Note: no count data, phenoData or featureData needed, though not recommended.
scMat2 <- newSCESet(exprsData = edgeR::cpm.default(scMat, prior.count = 5, log = TRUE))

Quality Control and Diagnostics

Uses the SCESet object:

Create logical matrix of which features are being expressed

is_exprs(scMat) <- calcIsExprs(scMat, lowerDetectionLimit = 4, 
                               exprs_data = "exprs")
head(is_exprs(scMat))

filter out genes that are not expressed with a lowerDectionLimit of 5

keep_genes <- rowSums(is_exprs(scMat)) > 5
scMat <- scMat[keep_genes, ]

Quality control metrics

The authors state: “Following QC, we can proceed with data normalisation before downstream analysis and modelling.”

Authors split the QC metrics/plots into three steps:

  1. QC and filtering of cells
  2. QC and filtering of features (e.g. genes)
  3. QC of experimental variables

Many of the functions in the QC are plots to explore and identify problems (e.g. features or cells that need to be removed). The function calculateQCMetrics() is used to compute commonly used QC metrics. Controls (features or cells) can be defined.

scMat <- calculateQCMetrics(scMat, feature_controls = 1:92, cell_controls = 1:5)
varLabels(scMat) # lists all QC metrics
##  [1] "Cell"                                 
##  [2] "Mutation_Status"                      
##  [3] "Cell_Cycle"                           
##  [4] "Treatment"                            
##  [5] "total_counts"                         
##  [6] "log10_total_counts"                   
##  [7] "filter_on_total_counts"               
##  [8] "total_features"                       
##  [9] "log10_total_features"                 
## [10] "filter_on_total_features"             
## [11] "pct_dropout"                          
## [12] "exprs_feature_controls"               
## [13] "pct_exprs_feature_controls"           
## [14] "filter_on_pct_exprs_feature_controls" 
## [15] "pct_exprs_top_50_features"            
## [16] "pct_exprs_top_100_features"           
## [17] "pct_exprs_top_200_features"           
## [18] "counts_feature_controls"              
## [19] "pct_counts_feature_controls"          
## [20] "filter_on_pct_counts_feature_controls"
## [21] "pct_counts_top_50_features"           
## [22] "pct_counts_top_100_features"          
## [23] "pct_counts_top_200_features"          
## [24] "n_detected_feature_controls"          
## [25] "exprs_endogenous_features"            
## [26] "counts_endogenous_features"           
## [27] "log10_exprs_feature_controls"         
## [28] "log10_counts_feature_controls"        
## [29] "log10_exprs_endogenous_features"      
## [30] "log10_counts_endogenous_features"     
## [31] "is_cell_control"

QC and filtering cells

The calculateQCMetrics() function adds the following columns to pData(object):

Cell-level QC metric Description
total_counts total number of counts for the cell (aka ‘library size’)
log10_total_counts total_counts on the log10-scale
total_features the number of features for the cell that have expression above the detection limit (default detection limit is zero)
filter_on_total_counts would this cell be filtered out based on its log10-total_counts being (by default) more than 5 median absolute deviations from the median log10-total_counts for the dataset?
filter_on_total_features would this cell be filtered out based on its total_features being (by default) more than 5 median absolute deviations from the median total_features for the dataset?
counts_feature_controls total number of counts for the cell that come from (a set of user-defined) control features. Defaults to zero if no control features are indicated.
counts_endogenous_features total number of counts for the cell that come from endogenous features (i.e. not control features). Defaults to total_counts if no control features are indicated.
log10_counts_feature_controls total number of counts from control features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite values) if no control features are indicated.
log10_counts_endogenous_features total number of counts from endogenous features on the log10-scale. Defaults to zero (i.e. log10(0 + 1), offset to avoid infinite values) if no control features are indicated.
n_detected_feature_controls number of defined feature controls that have expression greater than the threshold defined in the object.
pct_counts_feature_controls percentage of all counts that come from the defined control features. Defaults to zero if no control features are defined.

Other notes about filtering cells:

  • Multiple sets of feature controls can be defined
  • Where counts appear above, similar metrics can be computed for exprs, tpm and fpkm

QC and filtering features

The calculateQCMetrics() function adds the following columns to fData(object):

Feature-level QC metric Description
mean_exprs the mean expression level of the gene/feature.
exprs_rank the rank of the feature’s expression level in the cell.
total_feature_counts the total number of counts mapped to that feature across all cells.
log10_total_feature_counts total feature counts on the log10-scale.
pct_total_counts the percentage of all counts that are accounted for by the counts mapping to the feature.
is_feature_control is the feature a control feature? Default is FALSE unless control features are defined by the user.
n_cells_exprs the number of cells for which the expression level of the feature is above the detection limit (default detection limit is zero).

QC of experimental variables

TBA

Many plot functions available for SCESet object

Many plotting functions are available for visualising the data. Many of the plots create can facet or group the gene expression by factors.

plot()

Gives an overview of expression across cells. Plots cumulative proportion of each cell’s library that is accounted for by the highest-expressed features (default showing top 500 features); shows differences in expression distributions across cells (similar idea to boxplots)

  • plot(object, block1 = "groupFactor1", block2 = "groupFactor2", nfeatures = 500, exprs_values = "counts")
  • uses TPM as default (or will use exprs(object)). Other values can be specified with exprs_values argument (e.g. exprs, tpm and fpkm)
  • block1 and block2 are optional arguments that can create a faceted set of plots separated by groupFactor1 phenotypic factor information in phenoData (e.g. batch). Allows user to see large differences in distributions of expression across experimental blocks, batches, etc.
plot(scMat) 
## Warning in plotSCESet(x, ...): The object does not contain tpm expression
## values. Using exprs(object) values instead.

plot(scMat, block1 = "Mutation_Status", block2 = "Treatment",
     colour_by = "Cell_Cycle", nfeatures = 300, exprs_values = "counts")

plotExpression()

Plot expression levels for a defined set of features

  • plotExpression(object, rownames(scMat)[1:6], x = "groupFactor1", showMedian = FALSE, show_violin = TRUE)
  • the optional x argument splits each feature based on some phenotypic information (e.g. mutation status, batch)
plotExpression(scMat, rownames(scMat)[1:6],
               x = "Mutation_Status", exprs_values = "exprs")

plotExpression(scMat, rownames(scMat)[7:12],
               x = "Mutation_Status", exprs_values = "counts", colour = "Cell_Cycle",
               show_median = FALSE, show_violin = TRUE,  xlab = "Mutation Status",
               log = TRUE)

plotQC()

Methods to produce various QC diagnostic plots.

  • plotQC(object, type = "highest-expression", exprs_values = "tpm")
    • type = "highest-expression": plot the most expressed features across the dataset. Default shows top 50 features.
    • type = "exprs-freq-vs-mean": plot frequency of expression (number of cells with expression for a gene above the defined threshold) vs mean expression level. Gives an idea of technical noise in data set.
  • Note: Vignette states you can use multiplot function to plot multiple ggplot2 plots on the same page.
  • Find the most important PCs for a given cell phenotype or metadata variable (from pData(object))
  • Plot a set of cell phenotype/metadata variables against each other and calculating the (marginal) percentage of feature expression variance that they explain
# filter features
keep_feature <- rowSums(is_exprs(scMat)) > 4
scMatSub <- scMat[keep_feature,]

## Plot QC
plotQC(scMatSub, type = "highest-expression")

plotQC(scMatSub, type = "exprs-freq-vs-mean")

plotFeatureData()

Plot feature metadata and QC metrics. Can identify problematic cells by exploring relationship bewteen QC metrics computed from calculateQCMetrics(). Output is a ggplot2 object.

  • plotFeatureData(object, aes(x = n_cells_exprs, y = pct_total_counts))
plotFeatureData(scMat, aes(x = n_cells_exprs, y = pct_total_counts))

Shows a small number of features that are ubiquitously expressed expressed in all cells (n_cells_exprs) and account for a large proportion of all counts observed (pct_total_counts; more than 0.5% of all counts).

plotPhenoData()

Plot cell metadata and QC metrics. Output is a ggplot2 object.

  • plotPhenoData(object, aes(x = total_counts, y = total_features, colour = groupFactor1))
plotPhenoData(scMat, aes(x = total_counts, y = total_features,
                                  colour = Mutation_Status))

plotPhenoData(scMat, aes(x = Mutation_Status, y = total_features,
                                  colour = log10_total_counts))

plotPCA()

Produce a PCA plot for visualizing the cells. PCs calculated using 500 features with most variable expression across all cells (can be changed with ntop argument) from the exprs slot (but can be changed e.g. exprs_values="cpm").

  • plotPCA(object, ntop=500, exprs_values="exprs"): p
    • feature_set = fData(object)$is_feature_control: if control features are defined, you can use only these to calculate PCs.
    • ncomponents = 4, colour_by = "Treatment", shape_by = "Mutation_Status": allows more than 2 PCs to be plotted and allows phenotypic information to be defined using colour, shapes, size etc.
    • object <- plotPCA(object): the SCESet object has a reducedDimension slot which stores the top PCs. This can be accessed using the reducedDimension() function (e.g. reducedDimension(object)).
plotPCA(scMat)

Other plot functions.

  • plotTSNE(object, colour_by = "Gene_0001", size_by = "Gene_1000"): produce a t-distributed stochastic neighbour embedding (reduced dimension) plot for the cells
  • plotDiffusionMap(object, colour_by = "Gene_0001", size_by = "Gene_1000"): produce a diffusion map (reduced dimension) plot for the cells
  • plotMDS(object): produce a multi-dimensional scaling plot for the cells
  • plotReducedDim(object): plot a reduced-dimension representation of the cells

Plots

Example:

plot(scMat) 

plot_genes_jitter(sc[1:2,], grouping = "groupFactor", ncol = 2)
plot_spanning_tree(sc) # plots the order of the cells using PC1 and PC2. 
plot_genes_in_pseudotime(sc_subset, color_by = "Time") 
  • plot_genes_jitter(object, grouping = "groupFactor", ncol = 2) = Create plots of gene expression grouped by factors (only for a small number of genes). Based on the CellDataSet object.
  • plot_spanning_tree(object) = Plots the minimum spanning tree on cells after applying the orderCells() function.
  • plot_genes_in_pseudotime() = Plots expression for one or more genes as a function of pseudotime.